-
Notifications
You must be signed in to change notification settings - Fork 156
mm: BPF OOM #10848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mm: BPF OOM #10848
Conversation
|
Upstream branch: 78980b4 |
91d46f6 to
9624cf2
Compare
|
Upstream branch: 8016abd |
b4a1501 to
a1e0328
Compare
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: AI-authorship-score: low |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: AI-authorship-score: low |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: AI-authorship-score: low |
|
Forwarding comment 3802813229 via email |
|
Forwarding comment 3802814758 via email |
|
Forwarding comment 3802823021 via email |
|
Upstream branch: 8016abd |
a1e0328 to
f2bac64
Compare
|
Upstream branch: 8016abd |
f2bac64 to
12e87cf
Compare
9624cf2 to
3a73c9c
Compare
|
Upstream branch: 1456ebb |
12e87cf to
deef2b5
Compare
3a73c9c to
aa9aae9
Compare
|
Upstream branch: 35538db |
deef2b5 to
a8b9031
Compare
aa9aae9 to
f94167c
Compare
|
Upstream branch: 35538db |
a8b9031 to
1a107d4
Compare
f94167c to
cd8cbf1
Compare
Move struct bpf_struct_ops_link's definition into bpf.h, where other custom bpf links definitions are. It's necessary to access its members from outside of generic bpf_struct_ops implementation, which will be done by following patches in the series. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Yafang Shao <[email protected]>
Introduce an ability to attach bpf struct_ops'es to cgroups. From user's standpoint it works in the following way: a user passes a BPF_F_CGROUP_FD flag and specifies the target cgroup fd while creating a struct_ops link. As the result, the bpf struct_ops link will be created and attached to a cgroup. The cgroup.bpf structure maintains a list of attached struct ops links. If the cgroup is getting deleted, attached struct ops'es are getting auto-detached and the userspace program gets a notification. This change doesn't answer the question how bpf programs belonging to these struct ops'es will be executed. It will be done individually for every bpf struct ops which supports this. Please, note that unlike "normal" bpf programs, struct ops'es are not propagated to cgroup sub-trees. Signed-off-by: Roman Gushchin <[email protected]>
bpf_map__attach_struct_ops() returns -EINVAL instead of -ENOMEM on the memory allocation failure. Fix it. Fixes: 590a008 ("bpf: libbpf: Add STRUCT_OPS support") Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Yafang Shao <[email protected]>
Introduce bpf_map__attach_struct_ops_opts(), an extended version of bpf_map__attach_struct_ops(), which takes additional struct bpf_struct_ops_opts argument. This allows to pass a target_fd argument and the BPF_F_CGROUP_FD flag and attach the struct ops to a cgroup as a result. Signed-off-by: Roman Gushchin <[email protected]>
Struct oom_control is used to describe the OOM context. It's memcg field defines the scope of OOM: it's NULL for global OOMs and a valid memcg pointer for memcg-scoped OOMs. Teach bpf verifier to recognize it as trusted or NULL pointer. It will provide the bpf OOM handler a trusted memcg pointer, which for example is required for iterating the memcg's subtree. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Kumar Kartikeya Dwivedi <[email protected]> Acked-by: Yafang Shao <[email protected]>
mem_cgroup_get_from_ino() can be reused by the BPF OOM implementation, but currently depends on CONFIG_SHRINKER_DEBUG. Remove this dependency. Signed-off-by: Roman Gushchin <[email protected]> Acked-by: Michal Hocko <[email protected]>
Introduce a bpf struct ops for implementing custom OOM handling policies. It's possible to load one bpf_oom_ops for the system and one bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the cgroup tree is traversed from the OOM'ing memcg up to the root and corresponding BPF OOM handlers are executed until some memory is freed. If no memory is freed, the kernel OOM killer is invoked. The struct ops provides the bpf_handle_out_of_memory() callback, which expected to return 1 if it was able to free some memory and 0 otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed field of the oom_control structure, which is expected to be set by kfuncs suitable for releasing memory (which will be introduced later in the patch series). If both are set, OOM is considered handled, otherwise the next OOM handler in the chain is executed: e.g. BPF OOM attached to the parent cgroup or the kernel OOM killer. The bpf_handle_out_of_memory() callback program is sleepable to allow using iterators, e.g. cgroup iterators. The callback receives struct oom_control as an argument, so it can determine the scope of the OOM event: if this is a memcg-wide or system-wide OOM. It also receives bpf_struct_ops_link as the second argument, so it can detect the cgroup level at which this specific instance is attached. The bpf_handle_out_of_memory() callback is executed just before the kernel victim task selection algorithm, so all heuristics and sysctls like panic on oom, sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task are respected. The struct ops has the name field, which allows to define a custom name for the implemented policy. It's printed in the OOM report in the oom_handler=<name> format only if a bpf handler is invoked. Signed-off-by: Roman Gushchin <[email protected]>
Introduce bpf_oom_kill_process() bpf kfunc, which is supposed to be used by BPF OOM programs. It allows to kill a process in exactly the same way the OOM killer does: using the OOM reaper, bumping corresponding memcg and global statistics, respecting memory.oom.group etc. On success, it sets the oom_control's bpf_memory_freed field to true, enabling the bpf program to bypass the kernel OOM killer. Signed-off-by: Roman Gushchin <[email protected]>
Introduce bpf_out_of_memory() bpf kfunc, which allows to declare an out of memory events and trigger the corresponding kernel OOM handling mechanism. It takes a trusted memcg pointer (or NULL for system-wide OOMs) as an argument, as well as the page order. If the BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK flag is not set, only one OOM can be declared and handled in the system at once, so if the function is called in parallel to another OOM handling, it bails out with -EBUSY. This mode is suited for global OOM's: any concurrent OOMs will likely do the job and release some memory. In a blocking mode (which is suited for memcg OOMs) the execution will wait on the oom_lock mutex. The function is declared as sleepable. It guarantees that it won't be called from an atomic context. It's required by the OOM handling code, which shouldn't be called from a non-blocking context. Handling of a memcg OOM almost always requires taking of the css_set_lock spinlock. The fact that bpf_out_of_memory() is sleepable also guarantees that it can't be called with acquired css_set_lock, so the kernel can't deadlock on it. To avoid deadlocks on the oom lock, the function is filtered out for bpf oom struct ops programs and all tracing programs. Signed-off-by: Roman Gushchin <[email protected]>
Export tsk_is_oom_victim() helper as a BPF kfunc. It's very useful to avoid redundant oom kills. Signed-off-by: Roman Gushchin <[email protected]> Suggested-by: Michal Hocko <[email protected]>
Implement read_cgroup_file() helper to read from cgroup control files, e.g. statistics. Signed-off-by: Roman Gushchin <[email protected]>
Implement a kselftest for the OOM handling functionality. The OOM handling policy which is implemented in BPF is to kill all tasks belonging to the biggest leaf cgroup, which doesn't contain unkillable tasks (tasks with oom_score_adj set to -1000). Pagecache size is excluded from the accounting. The test creates a hierarchy of memory cgroups, causes an OOM at the top level, checks that the expected process is killed and verifies the memcg's oom statistics. The same BPF OOM policy is attached to a memory cgroup and system-wide. In the first case the program does nothing and returns false, so it's executed the second time, when it properly handles the OOM. Signed-off-by: Roman Gushchin <[email protected]>
Add a trace point to psi_avgs_work(). It can be used to attach a bpf handler which can monitor PSI values system-wide or for specific cgroup(s) and potentially perform some actions, e.g. declare an OOM. Signed-off-by: Roman Gushchin <[email protected]>
To allow a more efficient filtering of cgroups in the psi work tracepoint handler, let's add a u64 cgroup_id field to the psi_group structure. For system PSI, 0 will be used. Signed-off-by: Roman Gushchin <[email protected]>
Allow calling bpf_out_of_memory() from a PSI tracepoint to enable PSI-based OOM killer policies. Signed-off-by: Roman Gushchin <[email protected]>
Include CONFIG_PSI to allow dependent tests to build. Suggested-by: Song Liu <[email protected]> Signed-off-by: JP Kobryn <[email protected]> Signed-off-by: Roman Gushchin <[email protected]>
Add a PSI struct ops test. The test creates a cgroup with two child sub-cgroups, sets up memory.high for one of those and puts there a memory hungry process (initially frozen). The memory hungry task is creating a high memory pressure in one memory cgroup, which triggers a PSI event. The PSI BPF handler declares a memcg oom in the corresponding cgroup. Signed-off-by: Roman Gushchin <[email protected]>
|
Upstream branch: 35538db |
1a107d4 to
bc8d2af
Compare
cd8cbf1 to
358bea9
Compare
|
At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=1047339 expired. Closing PR. |
Pull request for series with
subject: mm: BPF OOM
version: 3
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1047339